199 research outputs found

    Components and Interfaces of a Process Management System for Parallel Programs

    Full text link
    Parallel jobs are different from sequential jobs and require a different type of process management. We present here a process management system for parallel programs such as those written using MPI. A primary goal of the system, which we call MPD (for multipurpose daemon), is to be scalable. By this we mean that startup of interactive parallel jobs comprising thousands of processes is quick, that signals can be quickly delivered to processes, and that stdin, stdout, and stderr are managed intuitively. Our primary target is parallel machines made up of clusters of SMPs, but the system is also useful in more tightly integrated environments. We describe how MPD enables much faster startup and better runtime management of parallel jobs. We show how close control of stdio can support the easy implementation of a number of convenient system utilities, even a parallel debugger. We describe a simple but general interface that can be used to separate any process manager from a parallel library, which we use to keep MPD separate from MPICH.Comment: 12 pages, Workshop on Clusters and Computational Grids for Scientific Computing, Sept. 24-27, 2000, Le Chateau de Faverges de la Tour, Franc

    Hybrid static/dynamic scheduling for already optimized dense matrix factorization

    Get PDF
    We present the use of a hybrid static/dynamic scheduling strategy of the task dependency graph for direct methods used in dense numerical linear algebra. This strategy provides a balance of data locality, load balance, and low dequeue overhead. We show that the usage of this scheduling in communication avoiding dense factorization leads to significant performance gains. On a 48 core AMD Opteron NUMA machine, our experiments show that we can achieve up to 64% improvement over a version of CALU that uses fully dynamic scheduling, and up to 30% improvement over the version of CALU that uses fully static scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic scheduling approach is up to 8% faster than the version of CALU that uses a fully static scheduling or fully dynamic scheduling. Our algorithm leads to speedups over the corresponding routines for computing LU factorization in well known libraries. On the 48 core AMD NUMA machine, our best implementation is up to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to 82% faster than MKL. Our approach also shows significant speedups compared with PLASMA on both of these systems

    Krylov methods preconditioned with incompletely factored matrices on the CM-2

    Get PDF
    The performance is measured of the components of the key interative kernel of a preconditioned Krylov space interative linear system solver. In some sense, these numbers can be regarded as best case timings for these kernels. Sweeps were timed over meshes, sparse triangular solves, and inner products on a large 3-D model problem over a cube shaped domain discretized with a seven point template. The performance of the CM-2 is highly dependent on the use of very specialized programs. These programs mapped a regular problem domain onto the processor topology in a careful manner and used the optimized local NEWS communications network. The rather dramatic deterioration in performance was documented when these ideal conditions no longer apply. A synthetic workload generator was developed to produce and solve a parameterized family of increasingly irregular problems

    Domain-decomposed preconditionings for transport operators

    Get PDF
    The performance was tested of five different interface preconditionings for domain decomposed convection diffusion problems, including a novel one known as the spectral probe, while varying mesh parameters, Reynolds number, ratio of subdomain diffusion coefficients, and domain aspect ratio. The preconditioners are representative of the range of practically computable possibilities that have appeared in the domain decomposition literature for the treatment of nonoverlapping subdomains. It is shown that through a large number of numerical examples that no single preconditioner can be considered uniformly superior or uniformly inferior to the rest, but that knowledge of particulars, including the shape and strength of the convection, is important in selecting among them in a given problem
    corecore